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Culturomics was recently introduced as the application of high-throughput data collection and analysis to the 
study of human culture. Here we make use of this data by investigating fluctuations in yearly usage frequencies 
of specific words that describe social and natural phenomena, as derived from books that were published over 
the course of the past two centuries. We show that the determination of the Hurst parameter by means of fractal 
analysis provides fundamental insights into the nature of long-range correlations contained in the culturomic 
trajectories, and by doing so, offers new interpretations as to what might be the main driving forces behind the 
examined phenomena. Quite remarkably, we find that social and natural phenomena are governed by fundamen- 
tally different processes. While natural phenomena have properties that are typical for processes with persistent 
long-range correlations, social phenomena are better described as nonstationary, on-off intermittent, or Levy 
walk processes. 

PACS numbers: 05.40.-a, 05.45.-a, 89.65.-s, 89.75.-k 



1. INTRODUCTION 

Observational data are often very complex, appearing with- 
out any structure or pattern in either time or space. Examples 
of such observations can be found across the whole spectrum 
of the social and natural sciences, ranging from economics 
Qh, to physics Q], biology JH], and medicine [4]. The origins 
of observed irregular behavior, however, are not always clear. 
Roughly five decades ago, deterministic chaos was discov- 
ered Hi and quickly rose to prominence as a possible mech- 
anism of inherent unpredictability and complexity Yet 
the strict criteria for declaring deterministic chaos in observed 
data iH, most notably the satisfaction of criteria for stationar- 
ity and determinism [2], and the verification of exponential di- 
vergence BSft llOll . are rarely satisfied. In response, attention has 
begun to shift from chaos to noise and random processes as al- 
ternate [11] (or, in many cases, as even more probable) sources 
of irregularity. While the theory of deterministic chaos re- 
lies on nonlinear dynamical systems with typically only a few 
degrees of freedom, the analysis of stochastic processes, es- 
pecially those that yield data with scale invariance, relies on 
random fractal theory lfl2ll or its generalization, multifractal 
theory Indeed, investigations based on these theoret- 

ical foundations may provide an elegant statistical characteri- 
zation of a broad range of heterogeneous phenomena Ill4ll , and 
in this paper, it is our goal to extend this theory to culturomics, 
as recently introduced in fiUl . 

Culturomics, and the study of human culture in general, 
seemingly has little to do with deterministic chaos and frac- 
tals. However, quantitative analyses of various aspects of 
human culture have become increasingly popular; examples 
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include the study of human mobility patterns llo - lla) . the 
spread of infectious diseases dSl and malware JH 01, 
the dynamics of online popularity 12511 . social movement 12611 
and language II27I429I1 . and even tennis 113011 - This progress is 
driven not only by important advances in theory and model- 
ing, but also by the increasing availability of vast amounts of 
data and knowledge, also referred to as metaknowledge 13111 . 
which allows scientists to apply advanced methods of analysis 
on a large scale |32|. The seminal study by lfl5ll was accom- 
panied by the release of a vast amount of data comprised of 
metrics derived from ~ 4% of books ever published (over five 
million in total), and it was this release that made the present 
study, i.e. the application of random fractal theory, possible. 
The data are available at ngrams.googlelabs.com as counts of 
n-grams that appeared in a given corpus of books published 
in each year. An n-gram is made up of a series of n 1-grams, 
and a 1-gram is a string of characters uninterrupted by a space. 
Note that a 1-gram is not necessarily a word, for it may be a 
number or a typo as well. Besides the counts of individual 
n-grams, the total counts of n-grams contained in each corpus 
of books in a given year are also provided, from which yearly 
usage frequencies can be obtained. 

In this paper, we show what new insights are attainable by 
applying random fractal theory to this vast culturomic data 
set. Our goal is to try and go beyond the interpretations of 
trajectories provided in lfl5ll by means of an accurate determi- 
nation of scaling parameters 1B3I1 . and in particular the Hurst 
parameter H, which enables us to characterize the nature of 
correlations (memory), if any, contained in the irregular time 
series. In general, data with long-range correlations are an 
important subclass of l/f a noise B34I436I1 . which is charac- 
terized by a power-law decaying power spectral density, and 
whose dimensionality cannot be reduced by principal compo- 
nent analysis since the rank-ordered eigenvalue spectrum also 
decays as a power law lt37ll . Processes that generate time se- 
ries with such properties are said to have antipersistent corre- 
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lations if < H < 1/2, are memoryless or have only short- 
range correlations if H = 1/2, and have persistent long-range 
correlations (long memory) if 1/2 < H < 1 [12]. Moreover, 
values of H > 1 are possible as well; these values, however, 
are characteristic of nonstationary processes or rather special 
stationary processes such as on-off intermittency with power- 
law distributed on and/or off periods and Levy walks IToll . 
(Note that the latter should not be confused with Levy flights, 
which are random processes consisting of many independent 
steps, and are thus memoryless with H = 1/2.) Prominent 
examples where 1//" noise was recently observed and quan- 
tified include DNA sequences II381I39I1 . human cognition 
and coordination [41], posture 04211 . cardiac dynamics 
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I46H . as well as the distribution of prime numbers B47H . to name 
but a few. 

Despite the many successful attempts at assessing long- 
range correlations in complex time series - for example, by 
means of detrended fluctuation analysis [48], as well as many 
other methods 09lll3|] - care should be exercised by their inter- 
pretation, particularly if one is faced with relatively short time 
series that contain tren ds ll49ll . nonstationarity [50], or signs of 
rhythmic activity II5U I52I1 . Although it is obviously impossi- 
ble to make general statements concerning these properties for 
all the n-grams contained in the corpus of the over five mil- 
lion digitized books, which amount roughly to over two billion 
culturomic trajectories, it is clear that the time series are short, 
comprising a little more than ~ 200 points corresponding to 
the two centuries considered (more precisely, from year 1770 
to 2007), and that many will inevitably contain strong trends 
lfl5ll . In order to successfully surpass the difficulties and pit- 
falls associated with the analysis of such time series IToll . be- 
sides the traditional detrended fluctuation analysis (DFA), we 
also use an adaptive fractal analysis (AFA), which is based on 
nonlinear adaptive multiscale decomposition. We use these 
methods to determine the Hurst parameter H for several 1- 
grams that are representative for social and natural phenom- 
ena. Examples of words that we focus on include war, unem- 
ployment, hurricane and earthquake (see Tables 1 and 2 for the 
complete list), and we find that those that describe social phe- 
nomena (war, unemployment, etc.) in general have different 
scaling properties than those describing natural phenomena 
(hurricane, earthquake, etc.). Our results can be corroborated 
aptly with arguments from real life, and they fit nicely to the 
declared goal of culturomics, which is to extend the bound- 
aries of scientific inquiry to a wide array of new phenomena 

The remainder of this paper is organized as follows. In the 
next section we present the main results, in section 3 we sum- 
marize them and discuss their potential implications, while in 
the appendix we describe the details of fractal analysis. 



2. RESULTS 

2.1. Natural phenomena 

We start by presenting the results of the adaptive fractal 
analysis for natural phenomena. In figure Q] we first plot in 
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FIG. 1: Adaptive fractal analysis of the usage frequency of the 1- 
gram "earthquake" in the corpus of English books. The Hurst pa- 
rameter, as obtained from the detrended data, is H = 0.65. a) The 
variation of the usage frequency of "earthquake" with time. The blue 
(thin) line depicts original data, while the red (thick) line depicts the 
estimated trend (using a window of length 101). b) Detrended data, 
i.e., the difference between the blue and red curves in panel a), c) 
Best fit to the F(w) versus w dependence for detrended data on a 
double log scale yields H — 0.65. d) Best fit to the F(w) versus w 
dependence for original data on a double log scale yields H = 0.75. 



panel (a) the original time series (thin blue line) and the esti- 
mated trend (thick red line) for the 1-gram "earthquake". The 
detrended data are presented in panel (b). It can be observed 
that overall the trend is very modest and simple, increasing 
only slightly towards the present day. Using equation [4] the 
Hurst parameter can be estimated from the slope of the F(w) 
versus w dependence on a double log scale. In panel (c), we 
show that the analysis of detrended data yields H = 0.65, 
while in panel (d), we show that H = 0.75 if the original data 
is used as input. Both calculations produce similar results, 
showing a very modest slope, and rely on statistically robust 
scaling. Based on the meaning of the Hurst parameter, the 
fractal analysis of the culturomic trajectory for "earthquake" 
reveals that this phenomenon has persistent long-range corre- 
lation. 

As another example, we show in figure|2]the same analysis 
for the 1-gram "hurricane". Unlike the "earthquake" trajec- 
tory, the trend for "hurricane" is more pronounced. It has a 
strong upwards component, especially in the last couple of 
decades. Hence, it can be expected that the discrepancy of 
the two estimated H values for the original and detrended 
data will be somewhat larger than that for the 1-gram "earth- 
quake" analyzed in figure Q] This expectation is indeed con- 
firmed by comparing panels (c) and (d), from where it follows 
that for the detrended data H = 0.70 while for original data 
H = 0.85. Still, however, both results robustly classify "hur- 
ricane" as a phenomenon with persistent long-range correla- 
tions, thus adding to the evidence that this may be valid in 
general for natural phenomena. 
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FIG. 2: Adaptive fractal analysis of the usage frequency of the 1- 
gram "hurricane" in the corpus of English books. The Hurst param- 
eter, as obtained from the detrended data, is H = 0.70. a) The 
variation of the usage frequency of "hurricane" with time. The blue 
(thin) line depicts the original data, while the red (thick) line depicts 
the estimated trend (using a window of length 101). b) Detrended 
data, i.e. the difference between the blue and red curves in panel a), 
c) Best fit to the F(w) versus w dependence for detrended data on a 
double log scale yields H = 0.70. d) Best fit to the F(w) versus w 
dependence for original data on a double log scale yields H = 0.85. 



To test this hypothesis more thoroughly, we have performed 
the same analysis as depicted in figures Q] and |2j along with 
the detrended fluctuation analysis (DFA), for thirteen other 
phenomena that can be classified as characteristic of natu- 
ral phenomena. Although there may be some disagreement 
as to what terms are characteristic of natural phenomenon, 
and other 1 -grams as well as n-grams could be suggested as 
characteristic of natural phenomena and analyzed, we con- 
sider our selection to be sufficiently representative for this 
study. Supporting this assumption are the results presented 
in table U which point robustly towards the conclusion that 
natural phenomena in general really can be classified as pro- 
cesses with persistent long-range correlations. More specifi- 
cally, for detrended data, we find that all estimated Hurst pa- 
rameters are within the 1/2 < H < 1 range with an average 
of H = 0.69 (AFA), which leads us to the mentioned final 
conclusion. Results obtained for original data (before defend- 
ing, not shown), on the other hand, leave a bit more room for 
discussion. There, for certain 1 -grams, like "mudslide" and 
"flooding", the value of H is larger than one. This suggests 
that the data would be more appropriately described as being 
either nonstationary, on-off intermittent, or Levy walk-like. 
Such a discussion, however, would be to a large degree base- 
less as the upward trends occurring towards the present time in 
most n-grams describing natural phenomena must be properly 
taken into account. The observed trends may be considered as 
a straightforward consequence of the fact that we have more 
and more data readily available on natural phenomena, which 
is due to advancements in measuring techniques as well as the 



1 -grams 


Hurst Parameter (H) 


AFA 


DFA 


avalanche 


0.63 ± 0.06 


0.79 ± 0.06 


comet 


0.60 ± 0.03 


0.73 ± 0.04 


drought 


0.81 ± 0.05 


0.69 ± 0.09 


earthquake 


0.65 ± 0.02 


0.72 ± 0.03 


erosion 


0.85 ± 0.06 


0.86 ± 0.08 


fire 


0.67 ± 0.05 


0.70 ± 0.03 


flooding 


0.85 ± 0.06 


0.72 ± 0.08 


hurricane 


0.70 ± 0.03 


0.69 ± 0.08 


landslide 


0.66 ± 0.05 


0.41 ± 0.20 


life 


0.62 ± 0.03 


0.65 ± 0.06 


lightning 


0.63 ± 0.03 


0.70 ± 0.03 


mudslide 


0.80 ± 0.02 


0.58 ± 0.28 


tornado 


0.59 ± 0.02 


0.64 ± 0.06 


tsunami 


0.81 ± 0.05 


0.66 ± 0.03 


typhoon 


0.55 ± 0.02 


0.50 ± 0.09 



TABLE I: Hurst parameters H, as obtained for the detrended data 
of all fifteen considered 1 -grams describing natural phenomena. The 
left column lists results as obtained with the adaptive fractal analy- 
sis (AFA), while the right column lists results as obtained with the 
detrended fluctuation analysis (DFA). The range of values as ob- 
tained by AFA is 0.55 < H < 0.85, with an average over all fif- 
teen considered 1-grams equalling H = 0.69. With DFA we obtain 
0.41 < H < 0.85 and If = 0.67. 



increasingly global reach of the Internet. Modern data col- 
lection and telecommunication technologies have raised our 
awareness, in general, of natural phenomena, and as a result, it 
is reasonable to expect this increased awareness to be reflected 
in an increase of occurrences in the corpus. Note, however, 
that similar arguments can be raised for other fields and trivia 
(e.g. celebrity gossip, popular culture) as well, and thus one 
could argue that relatively, the usage frequencies should not 
necessarily increase as a result of that. 



2.2. Social phenomena 

Turning to social phenomena, we will show that the prob- 
lems discussed for natural phenomena are in some cases am- 
plified, but more importantly, that social phenomena, apart 
from rare exceptions, cannot be classified solely as processes 
with persistent-long range correlations. 

First, we presented the adaptive fractal analysis for the 1- 
gram "war" in figure [3] The original data depicted by the 
thin blue line in panel (a) are clearly reminiscent of histor- 
ical events, as World Wars I & II generate two large peaks 
that more or less dwarf the usage frequencies reported in other 
decades. This observation goes hand in hand not just with the 
magnitude of the two World Wars, but also with the increase in 
the usage frequency of "war" in the published literature at that 
time. In agreement with the historical events is the estimated 
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FIG. 3: Adaptive fractal analysis of the usage frequency of the 1- 
gram "war" in the corpus of English books. The Hurst parameter, as 
obtained from the detrended data, is H — 1.09. a) The variation of 
the usage frequency of "war" with time. The blue (thin) line depicts 
the original data, while the red (thick) line depicts the estimated trend 
(using a window of length 101). b) The detrended data, i.e. the 
difference between the blue and red curves in panel a), c) Best fit to 
the F(w) versus w dependence for detrended data on a double log 
scale yields H = 1.09. d) Best fit to the F(w) versus w dependence 
for original data on a double log scale yields H — 1.15. 



FIG. 4: Adaptive fractal analysis of the usage frequency of the 1- 
gram "unemployment" in the corpus of English books. The Hurst 
parameter, as obtained from the detrended data, is H = 1.32. a) 
The variation of the usage frequency of "unemployment" with time. 
The blue (thin) line depicts the original data while the red (thick) 
line depicts the estimated trend (using a window of length 101). b) 
Detrended data, i.e. the difference between the blue and red curves, 
c) Best fit to the F(w) versus w dependence for detrended data on a 
double log scale yields H = 1.32. d) Best fit to the F(w) versus w 
dependence for original data on a double log scale yields H = 1.39. 



trend line depicted by the thick red line in panel (a). However, 
even after the detrending, the resulting culturomic trajectory 
still clearly reflects history in that the periods of World Wars I 
& II stand out from the rest, as can be inferred from the curve 
depicted in panel (b). The Hurst parameter H determined us- 
ing the detrended and original data [presented in panels (c) 
and (d)] have similar values to each other (H = 1.09 for the 
detrended data and H = 1.15 for the original data). As a 
result, both classify "war" as either a nonstationary, on-off in- 
termittent, or a Levy walk-like process. 

Another illustrative example of fractal analysis is presented 
in figure |4] where we examine the 1-gram "unemployment". 
A crucial distinction from "war", as well as all the considered 
natural phenomena, is that unemployment was nonexistent, or 
at least it was not mentioned, in the literature prior to 1900, 
which is clearly inferable from the original data depicted thin 
blue in panel (a). With the coming of age of the industrial 
revolution, the job market began to take shape, and with it 
came, rather inevitably it seems, the problem of unemploy- 
ment. The trend line depicted thick red in panel (a) clearly 
captures this fact. Moreover, we note that the first broad peak 
in the plot starts at around 1930, and thus correlates well with 
the Great Depression, while the second broad peak starts at 
around 1970, and thus correlates with that period of US eco- 
nomic stagnation and high inflation that was linked with the 
Middle Eastern oil crisis. After detrending, the situation is of 
course only marginally improved (in terms of assuring a more 
stationary record), as can be concluded from the curve de- 
picted in panel (b). The Hurst parameters, equalling H = 1.32 



for the detrended data (c) and H = 1.39 for the original data 
(d), both clearly reflect nonstationarity, and accordingly, "un- 
employment" can be considered the result of such a process. 

As in the case with natural phenomena (see table U), we 
also performed the same fractal analysis as in "war" and "un- 
employment", along with the detrended fluctuation analysis 
(DFA), for thirteen other social phenomena. The results are 
presented in table[II] It can be observed that the large majority 
of considered 1 -grams have H > 1 (AFA), which indicates 
that social phenomena are most likely to be either nonsta- 
tionary, on-off intermittent, or Levy walk-like process. This 
conclusion is obtained irrespective of whether detrending is 
performed or not, although the average Hurst parameter for 
detrended data, equalling H = 1.11 (AFA), is smaller than 
that obtained for original data (before detrending, not shown), 
which is H = 1.26. This technical discrepancy, however, is 
likely due to the successful removal of some level of nonsta- 
tionarity that is in general characteristic of social phenomena 
(more so than of natural phenomena). We would like to note, 
however, that in general not all H > 1 occurrences should 
be, by default, attributed to nonstationarity in the trajectories. 
While visual inspection may lend support to such a conclu- 
sion, as this was the case for results presented in figure |H in 
general the H value alone cannot distinguish between nonsta- 
tionary, on-off intermittent or Levy walk-like processes. In 
fact, the time series are too short for a robust assessment of a 
more precise nature of the examined social phenomena. At a 
glance, and since this is indeed most common, it seems con- 
venient to attribute H > 1 in social phenomena to nonstation- 
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1 -grams 


Hurst Parameter (H) 


AFA 


DFA 


Christian 


0.85 ± 0.05 


0.95 ± 0.08 


communism 


1.32 ±0.04 


1.44 ±0.05 


crisis 


1.15 ±0.05 


1.13 ±0.08 


democracy 


1.18 ±0.02 


1.07 ±0.07 


education 


1.04 ±0.05 


1.09 ±0.13 


environment 


1.13 ±0.04 


1.24 ±0.08 


famine 


0.74 ± 0.02 


0.66 ± 0.06 


malnutrition 


1.10 ±0.07 


1.08 ±0.11 


politics 


1.14 ±0.03 


0.99 ± 0.06 


population 


1.01 ±0.06 


0.98 ±0.10 


recession 


1.33 ±0.05 


1.06 ±0.07 


socializing 


1.28 ±0.07 


1.28 ±0.09 


stock 


1.01 ±0.05 


0.99 ±0.11 


unemployment 


1.32 ±0.04 


1.28 ±0.04 


war 


1.09 ±0.03 


0.99 ±0.12 



TABLE II: Hurst parameters H, as obtained for the detrended data 
of all fifteen considered 1 -grams describing social phenomena. The 
left column lists results as obtained with the adaptive fractal analy- 
sis (AFA), while the right column lists results as obtained with the 
detrended fluctuation analysis (DFA). The range of values as ob- 
tained by AFA is 0.74 < H < 1.33, with the average over all fif- 
teen considered 1 -grams equalling H = 1.11. With DFA we obtain 
0.66 < H < 1.44 and H = 1.08. 



arity, yet only additional future data can enable us to differen- 
tiate whether the peaks are part of an on-off intermittent pro- 
cess with power law distributed on and/or off events, or if they 
are part of a Levy walk. Lastly, we would also like to point 
out that of course not all phenomena that can be considered 
as social will have H > 1. Examples include 1 -grams such 
as "famine" or "Christian", which for the largest parts of the 
recorded human history were either directly related to natural 
phenomena (severe droughts, flooding, or other phenomena 
negatively affected that season's yield on vegetables, crops, 
grass, and animal population, hence leading to famine) or have 
been an integral part of the human culture for a long time 
(prior to the start of the culturomic trajectories). Moreover, 
social topics that are of little interest will not garner much 
attention, and are as such also unlikely to have usage frequen- 
cies with H > 1. The social phenomena where the human 
factor has played a key role recently and which are reason- 
ably popular, however, all share features that are characteristic 
of processes with H > 1. In fact, it seems just to conclude 
that the more the social phenomena can be considered recent 
(unemployment, recession, democracy), the higher their Hurst 
parameter is likely to be (see table|II|. This agrees nicely also 
with the recent observation of bursts and heavy tails in human 
dynamics l53ll . 



3. DISCUSSION 

By applying fractal analysis based on DFA and AFA to cul- 
turomic trajectories of 1 -grams describing typical social and 
natural phenomena over the past two centuries, we have found 
that they obey different scaling laws. As we will discuss in 
what follows, our findings agree nicely with existing theory 
and expectations, as well as offer new interpretations as to 
what might be the main driving forces behind the examined 
phenomena. 

We find that natural phenomena have properties that are 
typical of processes that generate persistent long-range corre- 
lations, as evidenced by the Hurst parameter being in the range 
0.55 < H < 0.85, with an average over all fifteen considered 
1-grams equalling H = 0.69 (AFA). The prevalence of long- 
term memory in natural phenomena compels us to conjecture 
that the long-range correlations in the usage frequency of the 
corresponding terms is predominantly driven by occurrences 
in nature of those phenomena. Using data from five million 
digitized books to arrive at this understanding certainly sup- 
ports the declared goal of culturomics and lends strong sup- 
port to its core principles. Owing to this memory, and of 
course by using statistical data available, we know, based on 
the Gutenberg-Richter law 15411 . that in the United Kingdom, 
for example, an earthquake of 3.7 — 4.6 on the Richter scale is 
likely to happen every year, an earthquake of 4.7 — 5.5 is due 
approximately every 10 years, while an earthquake of 5.6 or 
larger is bound to happen every 100 years 15511 . Similar "sta- 
tistical predictions" are available for tsunamis and many other, 
if not all, natural phenomena. On a more personal level, this 
also agrees with how we naturally develop an understanding 
for the weather and related natural phenomena for the region 
we live in. 

Social phenomena, on the other hand, have the Hurst pa- 
rameter in the range 0.74 < H < 1.33, with an average over 
all fifteen considered 1-grams equalling H = 1.11 (AFA). 
This is indicative of nonstationary processes, or stationary 
processes like on-off intermittency with power-law distributed 
on and/or off periods or Levy walks. While our analysis does 
not allow distinction between these three options, it is clear 
that all these processes are fundamentally different from those 
describing natural phenomena. So while it is common to hear 
speculations about possible average periods regarding social 
phenomena - for instance, that there may be an average pe- 
riod between major wars or stock market crashes - our anal- 
ysis suggests this is not the case, and that social phenomena 
tend to follow different scaling laws than natural phenomena. 
Such a difference is not unexpected, as social phenomena are, 
by nature, more complex than natural phenomena; the former 
depend on political, economic, and social forces, as well as on 
natural phenomena. The results of this additional complexity 
can be seen in our fractal analysis of a set of culturomic tra- 
jectories. 

In summary, we hope to have successfully demonstrated 
that the data made available through the Culturomics project 
lfl5ll . when coupled with advanced methods of analysis, of- 
fer fascinating opportunities to explore human culture in the 
broadest possible sense. 
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APPENDIX A. METHODS 
A.l. Nonlinear adaptive multiscale decomposition 

Nonlinear adaptive multiscale decomposition starts by par- 
titioning a time series into segments of length w = 2n + 1, 
where neighboring segments overlap by n + 1 points, thus 
introducing a time scale of ^^-t = (n + 1)t, where r is 
the sampling time. Each segment is then fitted with the best 
polynomial of order M. Note that M = and 1 corre- 
spond to piece-wise constant and linear fitting, respectively. 
We denote the fitted polynomials for the i-th and (i + 1)- 
th segments by y^(h) and y^^'fa), respectively, where 
Wi h — 1, • • • , 2n+ 1. We then define the fitting for the over- 
lapped region as 

y ( c )(l) = Wl y il) {l + n) + w 2 y {t+1) (l), I = 1, •■•,7i + l, (1) 



where w\ 

(1 - dj/ri) for j 



l—^) and wo = can be written as 
1, 2, and where dj denotes the distances 
between the point and the centers of j/W and y^ l+1 \ respec- 
tively. This means that the weights decrease linearly with 
the distance between the point and the center of the segment. 
Such a weighting ensures symmetry and effectively eliminates 
any jumps or discontinuities around the boundaries of neigh- 
boring segments. In fact, the scheme ensures that the fitting is 
continuous everywhere, is smooth at the non-boundary points, 
and has the right- and left-derivatives at the boundary. More- 
over, since it can deal with an arbitrary trend without a priori 
knowledge, it can remove nonstationarity, including baseline 
drifts and motion artifacts [56], and the procedure may also be 
used as either high-pass or low-pass filter with superior noise- 
removal properties than linear filters, wavelet shrinkage, or 
chaos-based noise reduction schemes Il57ll. 



A.2. Fractal analysis 

Based on the described adaptive decomposition, a fractal 
analysis can be conducted as follows. Let {x\, X2, • ■ • , x n } 
be a stationary stochastic process with mean x and autocorre- 
lation function of type 



~(fc) 



•2H-2 



as k — > 



(2) 



where H is the Hurst parameter. This is often called an incre- 
ment process, and its power spectral density is 1/ f l . The 
integral of the increment process: 



u(i) 



k=l 



(Xk 



1,2, 



(3) 



on the other hand, is called a random walk process, and its 
power spectral density is l/f 2 + . Starting from an incre- 
ment process, similarly to detrended fluctuation analysis l48ll . 
we first construct a random walk process using equation (0. 
If, however, the original data can already be classified as a 
random walk-like process, then this step is not necessary, al- 
though for ideal fractal processes there is no penalty even if 




FIG. 5: Scaling analysis of fractional Gaussian noise processes of the 
same length as the 1-gram data (240 points). Blue (circles) and red 
(diamonds) curves depict results as obtained by means of detrended 
fluctuation analysis (DFA) and adaptive fractal analysis (AFA), re- 
spectively, for three different values of H. It can be observed that 
both methods yield consistent results, regardless of the shortage of 
the examined time series. 



this step is done. Next, for a window size w, we determine, 
for the random walk process u(i) (or the original process 
if it is already a random walk-like process), a global trend 
v(i), i — 1, 2, • • • ,N, where N is the length of the walk. 
The residual, u(i) — v(i), characterizes fluctuations around 
the global trend, and its variance yields the Hurst parameter 
H according to 



, N 

• =1 



1/2 



1 w 



H 



(4) 



The validity of equation © can be proven if one starts from 
an increment process with the Hurst parameter equal to H. 
Using Parseval's theorem [9], the variance of the residual data 
corresponding to a window size w may be equated to the total 
power P in the frequency range (f w , f cu tof /) as 



fc 



'toff 



1 



/ 



2H+ 



1 



-2H \ 

CUtoff J ! 



(5) 



where f w = 1/ w, and f cu tof / is the highest frequency of the 
data. When /„, <C fcutof f> we see that equation (0 has to 
be valid. In fact, the above treatment makes it clear that even 
if we start from a random walk process with the Hurst expo- 
nent equal to H, integration will give the process a spectrum 
of 1//2H+1+2 = 1 /y2(j?+i)+i j and therefore, the final Hurst 
parameter will be simply H + 1. This in turn indicates that 
there is indeed no penalty if one uses equation ([3]) when the 
data are already a random walk-like process. Note that the 
proposed approach, if needed, can be readily extended and 
applied successfully to multifractal as well as higher dimen- 
sional data. 

The described fractal analysis approach, which we will re- 
fer to as adaptive fractal analysis (AFA), in general yields re- 
sults that are consistent with the traditionally used detrended 
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fluctuation analysis l48tl (DFA), as can be concluded from re- 
sults presented in figure [5] Nevertheless, especially for pro- 
cesses having H > 1 lUOIl . AFA may yield better scaling, 
which is why, although we analyze the culturomic trajecto- 
ries with both methods, we rely on the results of AFA for fi- 
nal interpretation. The potential advantage of adaptive fractal 
analysis over detrended fluctuation analysis is due to the fact 
that the trend for each window of size w obtained by AFA is 
smooth, while that obtained by DFA may change abruptly at 
the boundary of neighboring segments. For short nonstation- 
ary time series this may prove favorable for obtaining better 



scaling in the F(w) versus w dependence. 
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